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© Massively parallel processor including slice-wise communications arrangement 



© A massively parallel processing system compris- 
ing a plurality of processing nodes controlled in 
parallel by a controller. The processing nodes are 
inter-connected by a plurality of communications 
links. Each processing node comprises a memory, a 
transposer module and a router node. The memory 
stores data in slice format. The transposer module is 
connected to the memory and generates transpose 
data words of selected ones of the data slices from 



the memory. The router node is connected to the 
transposer module and to the communications links 
and transfers transpose data words over the commu- 
nications links to thereby transfer the data slices 
between processing nodes. Finally, the controller 
controls the memories, transposer modules and rout- 
er nodes of the processing nodes in parallel, to 
facilitate transfer of data slices among the process- 
ing nodes in unison. 
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INCORPORATION BY REFERENCE 

U. S. Patent No. 4,598.400, issued July 1, 
1986, to W. Daniel Hilfis, for Method and Apparatus 
For Routing Message Packets, and assigned to the 
assignee of the present application, incorporated 
herein by reference. 

U. S. Patent No. 4,814.973, issued March 21. 
1989, to W. Daniel Hillis. for Parallel Processor, 
and assigned to the assignee of the present ap- 
plication, incorporated herein by reference. 

U. S. Patent Application Ser. No. 07/043,126. 
filed April 27. 1987, by W. Daniel Hillis, et al, for 
Method and Apparatus For Routing Message Pack- 
ets, and assigned to the assignee of the present 
application, incorporated herein by reference. 

, U. S. Patent Application Ser. No. 07/179,020, 
filed April 8. 1988, by Brewster Kahle. et al., for 
Method and Apparatus For Interfacing Parallel Pro- 
cessors To A Co-Processor, and assigned to the 
assignee of the present application, incorporated 
herein by reference. 

FIELD OF THE INVENTION 

The invention relates generally to the field of 
massively parallel computer systems, and more 
particularly to communications arrangements for 
transferring data among processing nodes in such 
systems. 

BACKGROUND OF THE INVENTION 

A computer system generally includes one or 
more processors, a memory and an input/output 
system. The memory stores data and instructions 
for processing the data. The processor(s) process 
the data in accordance with the instructions, and 
store the processed data in the memory. The 
input/output system facilitates loading of data and 
instructions into the system, and obtaining pro- 
cessed data from the system. 

Most modem computer systems have been 
designed around a "von Neumann" paradigm, un- 
der which each processor has a program counter 
that identifies the location in the memory which 
contains the its (the processor's) next instruction. 
During execution of an instruction, the processor 
increments the program counter to identify the 
location of the next instruction to be processed. 
Processors in such a system may share data and 
instructions; however, to avoid interfering with each 
other in an undesirable manner, such systems are 
typically configured so that the processors process 
separate instruction streams, that is, separate se- 
ries of instructions, and sometimes complex proce- 
dures are provided to ensure that processors' ac- 
cess to the data is orderly. 



In Von Neumann machines instructions in one 
instruction stream are used to process data in a 
single data stream. Such machines are typically 
referred to as SISD (single instruction/single data) 

5 machines if they have one" processor^ or MIMD 
(multiple instniction/multiple data) machines if they 
have multiple processors. In~a number of types of 
computations, such as processing of arrays of data, 
the same instruction stream may be used to pro- 

io cess data in a number of data streams. For these 
computations, SISD machines would iteratively per- 
form the same operation or series of operations on 
the data in each data stream. Recently, single 
instruction/multiple data (SIMD) machines have 

15 been developed which process the data in all of 
the data streams in parallel. Since SIMD machine 
process all of the data streams in parallel, such 
problems can be processed much more quickly 
than in SISD machines, and at lower cost than with 

20 MIMD machines providing the same degree of par- 
allelism. 

The aforementioned Hillis patents and Hillis, et 
al., patent application disclose an SIMD machine 
which includes a host computer, a micro-controller 

25 and an array of processing elements, each includ- 
ing a bit-serial processor and a memory. The host 
computer, inter alia, generates commands which 
are transmitted to the micro-controller. In response 
to a command, the micro-controller transmits one 

30 or more SIMD instructions to the array, each SIMD 
instruction enabling all of the processing elements 
to perform the same operation in connection with 
data stored in the elements' memories. 

The array disclosed in the Hillis patents and 

35 Hillis, et al.. patent application also includes two 
communications mechanisms which facilitate trans- 
fer of data among the processing elements. One 
mechanism enables each processing element to 
selectively transmit data to one of its four nearest- 

40 neighbor processing elements. The second mecha- 
nism, a global router interconnecting integrated cir- 
cuit chips housing the processing elements in a 
hypercube, enables any processing element to 
transmit data to any other processing element in 

45 the system. In the first mechanism, termed 
"NEWS" (for the North, East. West and South 
directions in which a processing element may 
transmit data, if the processing elements are con- 
sidered arranged in a two-dimensional array), the 

so micro-controller enables all of the processing ele- 
ments to transmit and to receive, bit-serial data in 
unison, from the selected neighbor. More recently, 
arrays have been developed in which "NEWS"- 
type mechanisms facilitate transfer of data in uni- 

55 son among processing elements that are consid- 
ered arranged in a three-dimensional array. 

On the other hand, in the global router, the 
data is transmitted in the form of messages, with 
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each message containing an address that identifies 
the processing element to receive the data. The 
micro-controller enables the processing elements 
to transmit messages, in bit serial format, through 
the global router in unison, and controls the timing 
of the global router, but it does not control the 
destination of the message, as it does in the 
NEWS mechanism. However, the address, and oth- 
er message protocol information that may be trans- 
mitted in the information, represents overhead that 
reduces the rate at which data can be transmitted. 

As noted above, the arrays disclosed in the 
Hillis patents and Hillis patent application include 
bit-serial processors. These processors process 
successive bits of data serially. More recently, pro- 
cessor arrays have been developed which, in addi- 
tion to the bit-serial processors, also include co- 
processors which process data, in word-parallel 
format Each of the co-processors is connected to 
a predetermined number of the bit-serial proces- 
sors to form a processing node. The aforemen- 
tioned Kahle, et al, patent application describes an 
arrangement for connecting such co-processors in 
the array. 

SUMMARY OF THE INVENTION 

The invention provides a new and improved 
communications arrangement for facilitating trans- 
fers of data among processing nodes in a proces- 
sor array. 

In brief summary, the invention provides a 
massively parallel processing system comprising a 
plurality of processing nodes controlled in parallel 
by a controller. The processing nodes are intercon- 
nected by a plurality of communications links. Each 
processing node comprises a memory, a trans- 
poser module and a router node. The memory 
stores data in slice format The transposer module 
is connected to the memory and generates trans- 
pose data words of selected ones of the data slices 
from the memory. The router node is connected to 
the transposer module and to the communications 
links and transfers transpose data words over the 
communications links to thereby transfer the data 
slices between processing nodes. Finally, the con- 
troller controls the memories, transposer modules 
and router nodes of the processing nodes in par- 
allel, to facilitate transfer of data slices among the 
processing nodes in unison. 

BRIEF DESCRIPTION OF THE DRAWINGS 

This invention is pointed out with particularity 
in the appended claims. The above and further 
advantages of this invention may be better under- 
stood by referring to the following description taken 
in conjunction with the accompanying drawings, in 



which: 

Rg. 1 is a block diagram of a portion of a 
computer system incorporating a communication 
arrangement in accordance with the invention; 

5 Figs. 2A and 2B are flow diagrams useful in 
understanding the operation of the new commu- 
nication arrangement; and 
Figs. 3A and 3B are diagrams depicting data 
structures useful in understanding the operation 

10 of the new communication arrangement. 

DETAILED DESCRIPTION OF AN ILLUSTRATIVE 
EMBODIMENT 

75 Rg. 1 is a block diagram of a portion of a 
computer system incorporating a communication 
arrangement in accordance with the invention. The 
computer system includes a micro-controller 5, 
which is controlled by a host 6 and which, in turn, 

20 controls an array of processing nodes, one of 
which, namely, processing node 10, is shown in 
Rg. 1. To accomplish processing, the host com- 
puter 6 transmits commands to the micro-controller 
5. In response to a command, the micro-controller 

25 5 may transmit one or more instructions or other 
sets of control signals which control processing 
and other operations, in parallel, to all of the pro- 
cessing nodes concurrently. In addition, a number 
of processing nodes 10 are interconnected, as de- 

30 scribed in the aforementioned Hillis patents, Hillis, 
et al., patent application, and Kahle, et al., patent 
application, to facilitate the transfer of data among 
the processing nodes 1 0. 

With reference to Rg. 1, processing node 10 

35 includes two processing element (PE) chips 11 H 
and 11 L (generally identified by reference numeral 
11) connected to a memory 12 over a data bus 13. 
In one embodiment the data bus includes thirty- 
two data lines D(31:0) which are divided into high- 

40 order data lines D(31:16), which connect to PE chip 
11H, and low-order data fines D(15:0), which con- 
nect to PE chip 11L Each PE chip 11 includes a 
set of serial processors, generally identified by 
reference numeral 14, and a router node, generally 

45 identified by reference numeral 15. The serial pro- 
cessors operate in response to SP INSTR serial 
processor instruction signals from the micro-con- 
troller 5 to perform processing on data stored in 
the memory 12. The memory 12 operates in re- 

50 sponse to MEM ADRS memory address signals, 
which identify storage locations in the memory 12, 
and MEM CTRL memory control signals which 
indicate whether data is to be stored in or transmit- 
ted from the location identified by the MEM ADRS 

55 memory address signals. Both the MEM ADRS 
memory address signals and the MEM CTRL 
memory control signals are provided by the micro- 
controller 5. The router nodes 15 also operate in 
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response to RTR CTRL router control signals, also 
from the micro-controller 5, to transmit messages 
containing data from one processing node 10 to 
another. 

In one embodiment each PE chip 1 1 includes 
sixteen serial processors 14, each of which is asso- 
ciated with one of the data lines of the data bus 13. 
That is, each serial processor 14 receives data bits 
from, and transmits data bits onto, one of the data 

lines Dp) [V is an integer from the set (31 0)]. 

The memory 12 has storage locations organized 
into thirty-two bit slices, with each slice being iden- 
tified by a particular binary-encoded value of the 
MEM ADRS memory address signals from the 
micro-controller 5. If data is to be transmitted from 
a slice in memory identified by a particular value of 
the MEM ADRS memory address signals, the 
memory 12 will transmit bits 31 through 0 of the 
slice onto , data lines D(31) through D(0), respec- 
tively. On the other hand, if data is to be loaded 
into a slice in memory identified by a particular 
value of the MEM ADRS memory address signals, 
the memory 12 will receive bits 31 through 0 of 
from data fines D(31) through D(0), respectively, 
and load them into respective bits of the slice. 

To perform processing on multi-bit words of 
data in the memory 12 using the serial processors 
14, the micro-controller 5 iterativefy generates 
MEM ADRS memory address signals whose values 
identify successive location in memory 12, and 
MEM CTRL memory control signals which enable 
the memory 12 to transmit or store slices of data, 
and SP INSTR serial processor instruction signals 
which enable the serial processors 14 to perform 
the required operations on the bits on their asso- 
ciated data lines Dp). The data in the memory 12 
thus may be viewed in two ways, namely, p) a slice 
view, identified by the arrow labeled "SLICE," re- 
presenting fixed-size words of data ("data slices") 
that will be transmitted from the memory onto the 
data bus 13, or that will be received by the mem- 
ory from the data bus 13, at one time in response 
to the MEM ADRS memory address signals, and 
pi) a processor view, identified by the arrow label- 
led "PROCESSOR," which represents the organi- 
zation in memory 12 of data which may be acces- 
sed by an individual serial processor. 

The router nodes 15 of all of the processing 
nodes are interconnected to facilitate transfer of 
messages among the processing nodes 10 com- 
prising the array. Each message includes an ad- 
dress to identify a processing node 10 and serial 
processor 14 that is the intended recipient of the 
message, and data. In one particular embodiment 
the router nodes are interconnected in the form of 
a hypercube. as described in the aforementioned 
Hillis patents. Each router node 15H and 15L, un- 
der control of RTR CTRL router control signals 



from the micro-controller 5, transmits messages to 
other router nodes 15 on other processing element 
chips 11 over a plurality of communications links 
identified by reference numerals HC_0 — H(11:0) 

5 and HC_O_L(11:0), respectively. 

In addition, each router node 15H and 15L 
receives messages from communications links 
identified by reference numerals HC_J__H(11:0) 
and HC_J L(11:0) ( respectively. The router nodes 

w 15 determine from the address of each received 
message whether the message is intended for a 
serial processor 14 on the processing node 10 and, 
if so, couples it onto a data line Dp) of data bus 13 
over which the serial processor 14 that is to receive 

is the message accesses the memory 12. The micro- 
controller 13 generates MEM ADRS memory ad- 
dress and MEM CTRL memory control signals to 
facilitate the storage of the data from the message 
in the memory 12. On the other hand, if a router 

20 node 15 determines that a message is not intended 
for a serial processor 14 on the processing node 
10, it transmits it over one of the communications 
links HC_O_H(11:0) and HC__O_J_(11:0) as de- 
termined by the message's address. 

25 The various communications links HC__0_H- 
(11:0), HC_O__L(11:0) t HC_I_H{11:0) and 
HC_JML(11:0) connected to each processing node 
10 are connected to diverse ones of other process- 
ing nodes in a conventional manner to effect the 

30 hypercube interconnection. Thus, the outgoing 
communications links identified by reference nu- 
merals HC_O_H(ll:0) and HC_0_JL(1 1 :0) cor- 
respond to various incoming communications links, 
which may be identified by reference numerals 

35 HCJ_H<1 1 :0) and HC_I_L(1 1 :0). at router nodes 
15 of other processing nodes 10. In one embodi- 
ment the circuitry of the router nodes 15H and 15L 
is similar to that described in the aforementioned 
Hillis patents and Hillis, et al., patent application 

40 and will not be described further herein. 

The processing nodes 10 may also have an 
auxiliary processor 20 that processes data in mem- 
ory 12 that may be organized either in slice format 
or in processor format and a transposer module 21 

45 to interface the auxiliary processor 20 to the data 
bus 13. The auxiliary processor 20 may be, for 
example, a floating point processor, which may 
perform arithmetic and logic operations in connec- 
tion with data in floating point data format The 

so auxiliary processors 20 and transposer modules 21 
in the various processing nodes 10 operate in 
response to AP INSTR auxiliary processor instruc- 
tion signals and XPOSER CTRL transposer control 
signals, respectively, from the micro-controller 5. 

55 As is the case with the other control signals pro- 
vided by the micro-controller 5, the micro-controller 
5 transmits the AP INSTR auxiliary processor in- 
struction signals and the XPOSER CTRL trans- 
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poser control signals to control the auxiliary pro- 
cessor 20 and transposer module 21 of all of the 
processing nodes 10 concurrently, enabling them 
to generally perform the same operation concur- 
rently. 

The transposer module 21 includes several 
transposer circuits, two of which, identified by ref- 
erence numerals 22 and 23, are shown in Fig. 1. 
Transposer 22 receives input data from an input 
multiplexer 24 and stores it in one of a plurality of 
slots identified by the contents of a write pointer 
register 25. The register 25 may be provided with a 
pointer prior to storing each item of data in a slot in 
the transposer 22. Alternatively, the register may 
be loaded with an initial value before loading any 
data in the transposer 22 and then incremented for 
each successive item of data loaded therein. The 
input multiplexer 24, under control of the XPOSER 
CTRL transposer control signals, selectively cou- 
ples data signals to the transposer 22 from either 
the data bus 13 or from a bus 26. Bus 26 carries 
AP IN (31:0) auxiliary processor in signals repre- 
senting processed data from the auxiliary proces- 
sor 20. The transposer module 21 also includes an 
input multiplexer 27 and write pointer register 28 
which selectively controls storage of data in the 
transposer 23 in the same manner. 

The transposes 22 and 23 operate in response 
to the XPOSER CTRL transposer control signals to 
generate transpositions of the data stored therein. 
The transposer module 21 also includes two output 
multiplexers 30 and 31, also controlled by the 
XPOSER CTRL transposer control signals, which 
control the transfer of transposed data onto a bus 
32 for transmission to the auxiliary processor 20 or 
onto the data bus 13 for transmission to the mem- 
ory 12 or to the PE chips 11. Multiplexer 30 re- 
ceives data signals from the output terminals of 
transposers 22 and 23 and selectively couples the 
signals from one of the transposers onto the data 
bus 13. Similarly, the multiplexer 31 receives data 
signals from the output terminals of transposer 23 
and selectively couples the signals from one of the 
transposers onto the bus 32 for transmission to the 
auxiliary processor. 

Although not shown in Fig. 1, the processing 
node 10 may also provide a direct (that is, non- 
transposing) path between the data bus 13 and the 
auxiliary processor 20. It will be appreciated that 
the transposer module 21 facilitates the transposi- 
tion of data stored in the memory 12 in processor 
format, which would be transmitted serially over 
separate fines of the data bus 13, into parallel 
format for processing by the auxiliary processor 20. 
If the data is stored in memory 12 in slice format, 
transposition is not required. In addition, the tran- 
sposer module 21 receives processed data from 
the auxiliary processor 20 and, if it is required that 



it be stored in the memory 12 in processor format, 
transposes the data for transmission serially over 
predetermined lines of the data bus 13. If the 
processed data from the auxiliary processor 20 is 
s to be stored in the memory 12 in slice format, the 
data may be transmitted by the auxiliary processor 
20 to the memory 12 over the non-transposing 
path. 

In accordance with the invention, the trans- • 

to poser module 21 is also used to provide trans- 
posed data, originally stored in the memory 12 in 
slice format, for transmission by the router nodes 
15 of the processing elements 11, facilitating the 
transfer of data, in slice format, between process- 

75 ing nodes 10 over the various communications 
links interconnecting the router nodes 15. To ac- 
commodate this operation, since the micro-control- 
ler enables the processing nodes 10 to transmit 
and receive contemporaneously, one of the trans- 

20 posers, namely transposer 22, of the transposer 
module 21 in each processing node 10 will be 
designated a transmit transposer and be used for 
transmission, and the other transposer, namely 
transposer 23, will be designated a receive tran- 

25 sposer and be used for reception. 

The detailed operations by which data slices 
are transferred between processing nodes 10 will 
be described in connection with Figs. 2A and 2B, 
which contain flow diagrams describing transmis- 

30 sion and reception of the data, respectively, and 
Figs. 3A and 3B, which contain diagrams illustrat- 
ing the organization of the data in the transmit 
transposer 22 and receive transposer 23, respec- 
tively. Preliminarily, the transfer of data slices bo- 
ss tween processing nodes 10 proceeds in three gen- 
eral sequences. First, the micro-controller 5, in a 
series of iterations, enables the processing nodes 
10, in unison, to transfer data slices from the mem- 
ory 12 to the transmit transposer 22 (steps 101 

40 through 103, Fig. 2A). Thereafter, the micro-control- 
ler 5 enables the processing nodes 10 to iterative ty 
transmit, and contemporaneously to receive, the 
data over the communications links, and to load the 
received data into the receive transposers 23 

45 (steps 104 through 106, Rg. 2A, and steps 111 
through 114, Rg. 2B). Thus, while the flow dia- 
grams depicting transmission and reception are 
shown in separate Rgures, it should be appre- 
ciated that the micro-controller 5 will enable trans- 

50 mission (steps 104 through 106, Rg. 2A) and re- 
ception (steps 111 through 114, Rg. 2B) contem- 
poraneously on an interleaved basis. During recep- 
tion, a processing node 10 loads the received data 
in its receive transposer 23. After the receive trans- 

55 posers 23 have been filled, the micro-controller 5, 
in a series of iterations, enables the processing 
nodes 10 to transfer the contents of the receive 
transposers 23 to the respective memories 12 
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(steps 116 and 117. Fig. 2B). 

More specifically, with reference to Rgs. 1 and 
2A, initially the memory 12 includes a set of tran- 
sposer slot pointers ("XPOSER SLOT PTRS") and 
the data slices to be transmitted ("XMIT DATA"). 
The transposer slot pointers contain, in successive 
slices in memory 12, pointers to locations, iden- 
tified as slots, in transmit transposer 22 in which 
successive data slices of the memory 12 are to be 
stored. As will be described below in connection 
with Fig. 3A, the transposer slot pointers effectively 
select the particular data line D(i) of bus 13 over 
which the transmit transposer will couple each data 
slice, which, in turn, selects the particular commu- 
nications link HC_O_H(11:0) or HC_0__L(1 1 :0) 
over which each data slice will be transmitted. 
Since the communications links are connected to 
different processor nodes 10 in the array, the tran- 
sposer slot pointers effectively select the process- 
ing node 10 to receive each data slice comprising 
the transmit data. 

As noted above, the micro-controller 5 enables 
loading of the write transposer 22 in a series of 
iterations. In each iteration, the micro-controller 5 
generates MEM ADRS memory address signals 
and XPOSER CTRL transposer control signals that 
in each processing node 10, (1) enables the mem- 
ory 12 to couple a transposer slot pointer onto the 
data bus 13 and (2) enables the transposer module 
21 to load the pointer on the data bus into the write 
pointer register 25 (step 101). In the first iteration, 
during step 101 the MEM ADRS memory address 
signals point to the first location in memory 12 
which contains a transposer slot pointer, and in 
successive iterations the MEM ADRS memory ad- 
dress signals point to successive slices in memory 
12, which contain the successive transposer slot 
pointers. 

During each iteration, after enabling a tran- 
sposer slot pointer to be loaded into the write 
pointer register 25, the micro-controller 5 generates 
MEM ADRS memory address signals which point 
to a location in memory 12 containing a transmit 
data slice, and XPOSER CTRL transpose control 
signals that, in each processing node 10, (1) enable 
the memory 12 to couple a data slice onto bus 13, 
and (2) enable the transposer module 21 to couple 
the data slice on the data bus 13 through mul- 
tiplexer 24 and into the slot identified by the point- 
er in the transmit write pointer register 25. In the 
first iteration, during step 101 the MEM ADRS 
memory address signals point to the first location 
in memory 12 containing transmit data, and in 
successive iterations the MEM ADRS memory ad- 
dress signals point to successive slices in memory 
12. 

After enabling a data slice to be loaded into the 
transmit transposer 22, the micro-controller deter- 



mines whether the transmit transposer 22 has been 
filled (step 103), that is, if the transmit transposer 
22 has a data slice which can be transmitted over 
each of the communications links HC_0__H(1 1 :0) 
5 and HC_O_J.(11:0). If not, the micro-controller 
returns to step 101 to begin another iteration. If the 
micro-controller 5 determines that the transmit tran- 
sposer has been filled, it sequences to step 104 to 
begin transmitting the data therefrom. 

io Before proceeding further, it would be helpful 
to describe the contents of transmit transposer 22 
after it has been filled. With reference to Fig. 3A, 
the transmit transposer includes a series of slots 
50(0) through 50(31) [generally identified by refer- 

is ence numeral 50(i)], each of which stores one data 
slice transmitted thereto over data lines (31:0) com- 
prising data bus 13. The slot 50(i) in which a data 
slice is stored is identified by the pointer stored in 
the transmit write pointer register 25. As noted 

20 above, during each iteration the pointer in register 
25 is provided in step 101. prior to loading of the 
slot instep 102. 

In one embodiment the transmit transposer 22 
is filled when it contains data slices in at most slots 

25 50(0) through 50(11) and slots 50(16) through 50- 
(27). Since each of the router nodes 15L and 15H 
in each PE chip 1 1 is connected to twelye output 
communications links HC__O__L(11:0) and 
HC_O_H(11:0), in that embodiment data slices 

30 from only twenty-four slots, such as slots 50(0) 
through 50(11) and 50(16) through 50(27), can be 
transmitted contemporaneously. In that case, the 
transmit transposer 22 contains a data slice to be 
transmitted over each of the communications links, 

35 as shown in Fig. 3A; if data slices are stored in 
other slots 50(i) they will not be transmitted in that 
embodiment 

It will be appreciated that depending on the 
particular computation being performed by the 

40 computer system, the transmit transposer 22 may 
be deemed "filled," such that transmission can 
occur, if fewer than all of the slots 50(0) through 
50(11) and 50(16) through 50(27) contain data 
slices to be transmitted. For example, in perform- 

45 ing a "NEWS* transmission between the process- 
ing nodes 10 and their respective four or six near- 
est neighbors, only four or six slots 50(i) need 
contain data slices to be transmitted. In that case, 
the transposer slot pointers that are iteratively load- 
so ed into the transmit write pointer register 25 may 
be used to select the appropriate slots 50(i) in 
transmit transposer 22 so that the data slices will 
be transmitted to the appropriate nearest neighbor 
processing nodes 10. 

55 Returning to Rg. 2A, after the micro-controller 
5 determines that the transmit transposer has been 
filled, it initiates a series of iterations, each iteration 
comprising steps 104 through 106, to facilitate 
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transmission of the data from the transmit tran- 
sposer 22 over the communications links. In this 
operation, the micro-controller rteratively enables 
the transmission of sequential bits concurrently 
from all of the data slices stored in the transmit 
transposer 22. That is, during each iteration "i," the 
micro-controller 5 generates XPOSER CTRL tran- 
sposer control signals that enable the transmit tran- 
sposer 22 to couple a transmit transpose word 
through multiplexer 30 onto data bus lines 13 (step 
104). The transmit transpose word during iteration 
n \ m comprises the "i-th" bits in all of the slots 50 in 
the transmit transposer. With reference to Fig. 3A, 
during each iteration the data bit from slot 50(i) is 
transmitted onto data line D(i) of the' data bus 13. 

After data has been transmitted onto the data 
bus 13, the microcontroller 5 generates RTR 
CTRL router control signals that enable the router 
nodes 15H and 15L (Fig. 1) to transmit the bits on 
lines D{11:0) and D(16:27) onto the communica- 
tions links HC_O_L(11:0) and HC_O_H(11:0), 
respectively (step 105). Thereafter, the micro-con- 
troller 5 determines whether all of the data has 
been transmitted from the transmit transposer 22 
(step 106), and if not, it returns to step 104 to 
enable transmission of the next transmit transpose 
word. If, on the other hand, the micro-controller 5 
determines in step 106 that all of the data has been 
transmitted from the transmit transposer, it exits 
the transmission sequence (step 107). 

It will be appreciated that the number of iter- 
ations of steps 104 through 106 that are required to 
transmit the data from the transmit transposer 22 
corresponds to the number of bits of data in a data 
slice stored in the transmit transposer 22. The 
maximum number of transmit transpose words that 
the transmit transposer 22 can provide corresponds 
to the maximum number of bits in a data slice to 
be transmitted, which is" thirty-two in one embodi- 
ment Thus, in determining whether all of the data 
has been transmitted from the transmit transposer 
frn connection with step 104) the micro-controller 5 
can use an iteration counter to count iterations of 
steps 104 through 106, and exit when the iteration 
counter counts to a value corresponding to the 
number of bits in a data slice, or to a value cor- 
responding to the number of bits to be transmitted 
if less than all bits are to be transmitted. 

The sequence enabled by the micro-controller 
5 in connection with reception of the transmitted 
data will be described in connection with Rgs. 2B 
and 3B. As noted above, the micro-controller 5 will 
enable the processing nodes 10 to transmit and 
receive on an interleaved basis, that is, when the 
micro-controller 5 enables the router nodes 15H 
and 15L of the processing nodes 10 to transmit bits 
of a transpose word onto the communications links 
HC_O_H{11:0) and HC_0_ L(11:0) during one 



iteration, it also enables the processing nodes 10 to 
receive the bits from the communications links 
HC_J_H(11:0) and HC_J_L(11:0) during a con- 
temporaneous iteration of the receive sequence. 
5 Thus, at least a portion of the receive sequence 
depicted on Fig. 2B will occur contemporaneous 
with the transmission sequence depicted on Fig. 
2A. 

With reference to Fig. 2B f reception by the 

io processing nodes 10 of bits from the communica- 
tion links proceeds in a series of iterations, com- 
prising steps 112 through 115, each reception iter- 
ation occurring after data bits have been coupled 
onto the communications links during a transmis- 

15 sion iteration (steps 104 through 106, Fig. 2A). This 
allows the processing nodes 10 to receive the bits 
being transmitted thereto during the transmission 
iteration. During the successive reception iterations, 
the processing nodes 10 receive successive bits of 

20 the data slices from the other processing nodes 
connected thereto, in each iteration, each process- 
ing node 10 receives bits from corresponding bit 
locations in the data slices. In the successive iter- 
ations, each processing node 10 normally will store 

25 the bits in successive slots of the receive tran- 
sposer 23. Thus, initially the micro-controller 5 gen- 
erates XPOSER CTRL transposer control signals 
that enable the transposer module 21 in each pro- 
cessing node 10 to initialize its write pointer regis- 

30 ter 28 so as to point to the first slot of the receive 
transposer 23 (step 111). 

After initializing the write pointer register 28 of 
each processing nodes 10, the microcontroller 5 
initiates the sequential reception iterations, each 

35 comprising steps 112 through 115, to load received 
data into the receive transposer 23. During each 
iteration, the microcontroller 5 generates RTR 
CTRL router control signals that enable the router 
nodes 15H and 15L of the processing nodes 10 to 

40 receive the data bits then on communications links 
HCJ_H(11:0) and HCJ_J-(11:0) respectively 
and to couple them onto lines D(27:16) and D(11:0) 
of the data bus 13 (step 112). Thereafter, the 
micro-controller 5 generates XPOSER CTRL tran- 

45 sposer control signals that enable the multiplexer 
27 to couple the signals on lines D(31:0) of the 
data bus 13 to the receive transposer 23, and the 
receive transposer 23 to store them in the slot in 
receive transposer 23 (step 113) identified by the 

so contents of the write pointer register 28. 

With reference to Fig. 3B, as is the case with 
transmit transposer 22, the receive transposer 23 
includes a plurality of slots, identified as slot 60(0) 
through 60(31) [generally identified by reference 

55 numeral 60(i)]. Slot 60(i) in the receive transposer 
23 is loaded with the data bits received during the 
"i-th" reception iteration. In the successive iter- 
ations, bits from each of the communications links 
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HC_J_H(11:0) and HC__I_J_(1 1 :0) are coupled to 
the same bit locations in the successive slots 60. 
Thus, as shown in Rg. 3B, the data slices from the 
processing nodes 10 connected thereto are found 
in the same bit location in the successive slots in 5 
the receive transposer 23. It will be appreciated 
that each transpose word provided by the receive 
transposer 23 comprises the bits from the same bit 
locations in successive slots 60, which, as noted 
above, corresponds to the successive bits of a data w 
slice transmitted to the processing node 10. Ac- 
cordingly, the transpose words in the receive tran- 
sposer, which, as described below, will be stored 
as data slices in the memory 12 of receiving pro- 
cessing node 10, correspond to the data slices in 75 
memory 12 of the processing nodes 10 that trans- 
mitted them thereto. 

Returning to Rg. 23, after step 113 the micro- 
controller 5 then determines whether the receive 
transposers 23 in the processing nodes 10 have 20 
been filled (step 114), and, if not enables the 
processing nodes 10 to increment the receive write 
pointer store in their registers 28 (step 115). The 
receive transposer 23 will be filled if the number of 
reception iterations enabled by the micro-controller 25 
corresponds to the number of bits in a data slice, 
or a lesser number if fewer than all bits of the data 
slices are to be transmitted. If the micro-controller 
5 determines that the receive transposers 23 have 
not been filled, it returns to step 112 to initiate 30 
another reception iteration. 

On the other hand, if the micro-controller 5 
determines, in step 111, that the number of recep- 
tion iterations it has enabled during a message 
transfer cycle corresponds to the number of data 35 
bits in a data slice, it steps to a sequence, com- 
prising steps 116 and 117, in which it enables the 
processing nodes 13 to transfer the contents of 
their respective receive transposers 23 to their 
memories 12. In this operation, the micro-controller 40 
5 generates (i) MEM ADRS memory address sig- 
nals that identify a location in the receive data 
region of memory 12, (ii) XPOSER CTRL tran- 
sposer control signals that enable the receive tran- 
sposer 23 to couple a transpose word through 45 
multiplexer 30 onto data bus 13, and (iii) MEM 
CTRL memory control signals to enable the data 
represented by the signals on data bus 13 to be 
stored in memory 12 (step 116). The micro-control- 
ler 5 then determines whether it has enabled stor- 50 
age of all of the transpose words from the receive 
transposer 23 in the processing nodes 10 in their 
respective memories 12 (step 117). If the micro- 
controller 5 makes a negative determination in stop 
117, it returns to step 116 to enable storage of the 55 
next transpose word from receive transposers 23 in 
respective memories 12. However, if the micro- 
controller 5 makes a positive determination in step 



117, it exits (step 120). 

It will be appreciated that the initialization (step 
111) and incrementation (step 115) of the write 
pointer registers 28 that control storage of data in 
the receive transposers 23 of the respective pro- 
cessing nodes 10 is performed if the data bits of 
the slices received by the respective processing 
nodes are to be stored in the same order as they 
were transmitted. Depending on the computation 
being performed, it may be desirable to change the 
order of the bits, such as to interchange bytes 
(eight-bit sections) of the data slices. In that case, 
slot pointers similar to the transposer slot pointers 
used in connection with the transmit sequence (Fig. 
2A) may be provided in memory 12, which may be 
loaded into the write pointer register 28 prior to 
loading of received data into the receive transposer 
23, in a manner similar to step 101 of the transmit 
sequence (Rg. 2A). If sections or groups of bits in 
the received data are to be interchanged, pointers 
may be provided for the first locations in the re- 
ceive transposers 23 in which the data bits are to 
be stored, which may be incremented for succes- 
sive locations in the section. 

The communications arrangement provides a 
number of advantages. First, it facilitates the trans- 
fer of data organized in slice format among the 
processing nodes 10, which was not the case in 
the systems described in the aforementioned Hiilis 
patents and Hiilis, et al., patent application. In addi- 
tion, in a number of circumstances, the commu- 
nications arrangement can facilitate transfer of data 
at a higher rate than in either the global router or 
the NEWS mechanism of the system described in 
the Hiilis patents and Hiilis, et al., patent applica- 
tion. In particular, although one embodiment uses 
the same wires and router node circuitry as the 
global router described in the Hiilis patents, the 
arrangement can transfer data at a higher rate at 
least since the data being transferred does not 
include addressing information. 

In addition, the communications arrangement 
described herein facilitates faster transfer than the 
NEWS mechanism described in the Hiilis patents, 
since data can be transferred with all of the nearest 
neighbors at the same time, whereas with the 
NEWS mechanism data can only be transferred in 
one direction, with one nearest neighbor, at a time. 
Furthermore, while the NEWS mechanism facili- 
tates transfers with nearest neighbors in only a two- 
or three-dimensional array pattern, the communica- 
tion arrangement can facilitate transfer in array 
patterns in two, three and higher dimensions, which 
can be useful in a number of computations. 

The foregoing description has been limited to a 
specific embodiment of this invention. It will be 
apparent however, that variations and modifica- 
tions may be made to the invention, with the attain- 
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ment of some or all of the advantages of the 
invention. Therefore, it is the object of the appen- 
ded claims to cover all such variations and modi- 
fications as come within the true spirit and scope of 
the invention. 

Claims 

1. A massively parallel processing system com- 
prising: 

A. a plurality of processing nodes intercon- 
nected by a plurality of communications 
links, each processing node comprising: 

i. a memory for storing data in slices; 

ii. a transposer module connected to the 
memory for generating transpose data 
words of selected ones of the data slices 
from the memory; 

ii. a router node connected to the tran- 
sposer module and to the communica- 
tions links for transferring transpose data 
words over the communications links to 
thereby transfer the data slices between 
processing nodes; and 

B. a controller for controlling the memories, 
transposer modules and router nodes of the 
processing nodes in parallel, thereby to fa- 
cilitate transfer of data slices among the 
processing nodes in unison. 

2. A massively parallel processing system as re- 
cited in claim 1 in which the router node also 
receives data slices from the communications 
links, and the transposer module also gen- 
erates transpositions of the data slices re- 
ceived by the router node for storage in the 
memory. 

a. A massively parallel processing system as re- 
cited in claim 2 in which, in each processing 
node, the memory and the transposer module 
are interconnected by a bus comprising a plu- 
rality of data lines, and the memory stores 
each data slices in one of a plurality of storage 
locations each identified by an address, the 
controller coupling addresses to the memories 
to identify locations in the memories of data 
slices to be transmitted. 

4. A massively parallel processing system as re- 
cited in claim 2 in which each the transposer 
module includes: 

A. a transmit transposer for receiving data 
slices from the memory and for generating 
in response thereto transmit transpose 
words for transmission to the router node; 
and 

B. a receive transposer for receiving data 



slices from the router node and for generat- 
ing in response thereto receive transpose 
words for storage in the memory. 

5 5. A massively parallel processing system as re- 
cited in claim 4 in which the transmit tran- 
sposer includes a plurality of slots, each slot 
being associated with a communications link, 
the transposer module further includes a trans- 

w mit write pointer register for storing a pointer to 
a slot in the transmit transposer in which a 
data slice is to be stored, thereby associating 
the data slices with respective communications 
links over which they are to be transmitted, the 

is controller enabling establishment of the pointer 
value in the transmit write pointer register. 

6. A massively parallel processing system as re- 
cited in claim 5 in which the memory in each 

20 processing module stores successive transmit 
pointers, the controller successively enabling 
successive ones of the transmit pointers to be 
transferred to the transmit write pointer register 
to control storage of successive data slices in 

25 the slots of the transmit transposer in the pro- 
cessing nodes. 

7. A massively parallel processing system as re- 
cited in claim 4 in which the receive transposer 

30 includes a plurality of state, the transposer 
module of each processing node further in- 
cluding a receive write pointer register for stor- 
ing a pointer to a slot in the receive transposer 
in which data received by the router node is to 

35 be stored, the controller initializing the receive 
write pointer register to point to the first slot of 
the receive write pointer register and to 
iteratively increment for storage of successive 
data words from the router node. 

40 

8. A massively parallel processing system com- 
prising: 

A. a plurality of processing nodes intercon- 
nected by a plurality of communications 
45 links, each processing node comprising: 

L a memory including a plurality of stor- 
age locations, each identified by an ad- 
dress, for storing data slices and transmit 
slot pointers; 

50 ii. a router node connected to the tran- 

sposer module and to the communica- 
tions links for transferring and receiving 
transpose data words over the commu- 
nications links to thereby transfer the 

55 data slices between processing nodes; 

iii. a transposer module comprising: 

a. a transmit transposer for receiving 

data slices from the memory and for 
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generating in response thereto transmit 
transpose words for transmission to the 
router nodes, the transmit transposer in- 
cluding a plurality of slots, each slot be- 
ing associated with a communications 5 
link; 

b. a transmit write pointer register for 
storing a pointer to a slot in the transmit 
transposer in which a data slice is to be 
stored, thereby associating the data w 
slices with respective communications 
links over which they are to be transmit- 
ted; 

c. a receive transposer including a plural- 
ity of slots for receiving data slices from 75 
the router node and for generating in 
response thereto receive transpose 
words for storage in the memory; 

d. a receive write pointer register for 
storing a pointer to a slot in the receive 20 
transposer in which data received by the 
router node is to be stored; and 

B. a controller for controlling the memories, 
transposer modules and router nodes of the 
processing nodes in parallel, thereby to fa- 25 
cilitate transfer of data slices among the 
processing nodes in unison, the controller 
generating addresses for transmission to 
the memories to identify locations in the 
memories of date slices to be transmitted. 30 
the controller further successively enabling 
successive ones of the transmit pointers to 
be transferred to the transmit write pointer 
register to control storage of successive 
data slices in the slots of the transmit tran- 35 
sposer in the processing nodes, and the 
controller initializing the receive write point- 
er register to point to the first slot of the 
receive write pointer register and iteratively 
incrementing it to control storage of succes- 40 
stve data words from the router node in 
successive slots of the receive transposer. 



45 



50 



55 



10 



EP 0 456 201 A2 



< S ^ iu £ ° 

I (fl <KU 




il®MSG OUT!! 
*l 

ROUTER 
NODE 
(HIGH 
ORDER) 
cri 15H 

ffilst MSG 



0 ttttfttffit 



SLICE 



en 



HC.I_H(11:0) 



H 
U 

01 



11 



EP 0 456 201 A2 



101 SET MEM ADRS TO ADDRESS OF NEXT 

TRANSPOSER SLOT POINTER IN MEMORY AND 
TRANSFER OVER DATA BUS TO TRANSMIT 
WRITE POINTER REGISTER 



102 SET MEM ADRS TO ADDRESS OF NEXT TRANSMIT 
DATA SLICE AND TRANSFER TO SLOT OF 
TRANSMIT TRANSPOSER IDENTIFIED BY 
CONTENTS OF TRANSMIT WRITE POINTER 
REGISTER 



N 0.103 HAS TRANSMIT TRANSPOSER BEEN FILLED 
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I YES 



104 ENABLE TRANSMIT TRANSPOSER TO COUPLE 
TRANSPOSE WORD ONTO DATA BUS 



105 ENABLE ROUTERS TO TRANSMIT RESPECTIVE 
BITS ON DATA BUS LINES D(27:16) AND 
D(11-0) ON ROUTER OUTPUT LINES 
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111 INITIALIZE CONTENTS OF RECEIVE WRITE 
POINTER REGISTER 

1 

112 ENABLE ROUTERS TO RECEIVE BITS ON 
ROUTER INPUT LINES HC_I_H(11:0» AND 
HC_I-.L(11-0) AND COUPLE THEM ONTO DATA 
BUS LINES D(27:16) AND D(11:0), 
RESPECTIVELY 



113 ENABLE RECEIVE TRANSPOSER TO RECEIVE 
BITS ON DATA BUS LINES D(31:0) AND LOAD 
THEM INTO SLOT OF RECEIVE TRANSPOSER 
IDENTIFIED BY CONTENTS OF RECEIVE WRITE 
POINTER REGISTER 



114 HAS THE RECEIVE TRANSPOSER YES 
BEEN FILLED ? 

|NO 

115 INCREMENTCONTENTS OF RECEIVE WRITE 

POINTER REGISTER 

. f 

116 SET MEM ADRS TO ADDRESS OF NEXT 

RECEIVE DATA SLICE AND TRANSFER NEXT 
TRANSPOSE WORD FROM RECEIVE 
TRANSPOSER THERE TO 

J 

I 117 HAVE ALL TRANSPOSE WORDS BEEN 

TRANSFERRED FROM RECEIVE TRANSPOSER ? 

|YES 

120 EXIT 
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